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Abstract 


We provide a pedagogical introduction to the two main variants of real-space quantum Monte 
Carlo methods for electronic-structure calculations: variational Monte Carlo (VMC) and dif¬ 
fusion Monte Carlo (DMC). Assuming no prior knowledge on the subject, we review in depth 
the Metropolis-Hastings algorithm used in VMC for sampling the square of an approximate 
wave function, discussing details important for applications to electronic systems. We also re¬ 
view in detail the more sophisticated DMC algorithm within the fixed-node approximation, 
introduced to avoid the infamous Fermionic sign problem, which allows one to sample a more 
accurate approximation to the ground-state wave function. Throughout this review, we discuss 
the statistical methods used for evaluating expectation values and statistical uncertainties. In 
particular, we show how to estimate nonlinear functions of expectation values and their statis¬ 
tical uncertainties. 


Keywords: quantum Monte Carlo, electronic-structure calculations, Metropolis-Hastings algo¬ 
rithm, fixed-node approximation, statistical methods. 
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This chapter provides a pedagogical introduction to the two main variants of real-space 
quantum Monte Carlo (QMC) methods for electronic-structure calculations: variational Monte 
Carlo (VMC) and diffusion Monte Carlo (DMC). For more details of these methods, see, e.g.. 
Refs. [1, 2, 3, 4, 5, 6]. For reviews on applications of QMC methods in chemistry and condensed- 
matter physics, see, e.g.. Refs. [7, 8]. 


1 Variational Monte Carlo 


1.1 Basic idea 


The idea of the VMC method [9, 10] is simply to calculate the multidimensional integrals ap¬ 
pearing in quantum mechanics using a Monte Carlo numerical integration technique^. The 
quantity of greatest interest is the variational energy associated with a Hamiltonian H and a 
wave function 'h, which can be written as 


(^|T) 


f dRT(R)2EL(R) 
/dRT(R)2 


= j dKpiR)EL(R), 


( 1 ) 


where £'l(R) = (-f^'I'(It))/T(R) is the local energy depending on the 3N coordinates R of 
the N electrons, and /^(R) = 'I'(R)^/ f dR'I'(R)^ is the normalized probability density. For 
simplicity of notation we have assumed that il'(R) is real valued; the extension to complex ih(R) 
is straightforward. The variational energy can be estimated as the average value of £'l(R) on a 
sample of M points R^ sampled from the probability density p(R), 


~ Ei^ 


1 

M 


M 


k=l 


( 2 ) 


In practice, the points R^ are sampled using the Metropolis-Hastings algorithm [12, 13]. 

The advantage of this approach is that it does not use an analytical integration involving the 
wave function, and thus does not impose severe constraints on the form of the wave function. 
The wave functions usually used in QMC are of the Jastrow-Slater form. 


T(R) = J(R)4>(R), (3) 

where J(R) is a Jastrow factor and 4>(R) is a Slater determinant or a linear combination of 
Slater determinants^. The Jastrow factor is generally of the form J(R) = It depends 

explicitly on the interparticle distances Vij, allowing for an efficient description of the so-called 
electron “dynamic” correlation. 

In practice, the VMC method has two types of errors: 


• a systematic error, due to the use of an approximate wave function (as in other wave- 
function methods), 

^To the best of our knowledge, the first calculation of multidimensional integrals appearing in quantum me¬ 
chanics by using Monte Carlo methods was done by Conroy [11]. 

^In QMC, it is convenient to use wave functions in which the values of the spin coordinates have been fixed, 
so '5 is a function of the spatial coordinates R only. 
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• a statistical uncertainty, due to the sampling of finite size M (which is specific to Monte 
Carlo methods). 

Of course, the variational energy is an upper bound of the exact ground-state energy, but 
the systematic error is generally unknown, since its determination requires knowing the exact 
solution. By contrast, the statistical uncertainty can be easily estimated by the usual statistical 
techniques. For this, let us examine more closely the meaning of Eq. (2). The average of the 
local energy El on a finite sample is itself a random variable, taking different values on different 
samples. The central limit theorem establishes that, if EL(Rfc) are random variables that are 
independent (i.e. not correlated) and identically distributed, with finite expected value E[El] and 
finite variance, V[El] = E[(El — Ev)^], then in the large M limit the probability distribution of 
the random variable El converges (in the mathematical sense of convergence in distribution) to 
a Gaussian (or normal) distribution of expected value E[El] and variance V[El]/M, 

E [El] = E[El] = E„ (4a) 

V [El] = (4b) 

This means that El is an estimator of with a statistical uncertainty which can be defined 
by the standard deviation of its Gaussian distribution 

[El] = v^V [El] = (5) 

The meaning of this standard deviation is that the desired expected value Ev has a probability of 
68.3% of being in the interval [El — a, Ei^ + a], a probability of 95.5% of being in the interval 
[El — 2a, Ei^ + 2(t] , and a probability of 99.7% of being in the interval [El — 3(T, El -|- 3(j] . 
Note that, if the variance V[El] is infinite but the expected value E[El] is finite, then the law of 
large numbers guarantees the convergence of El to E[El] when M —)• oo but with a statistical 
uncertainty which is more difficult to estimate and which decreases more slowly than 1/\/M. 

It is important to note that the statistical uncertainty decreases as 1/y/M independently of 
the dimension of the problem. This is in contrast to deterministic numerical integration methods 
for which the convergence of the integration error deteriorates with the spatial dimension d. For 
example, Simpson’s integration rule converges as (provided the integrand has up to 

4t/i-order derivatives), so that for d > 8 Monte Garlo methods are more efficient for large M. 

The statistical uncertainty is reduced if the variance of the local energy V[El] is small. In 
the limit that T is an exact eigenfunction of H, the local energy El becomes exact, independent 
of R, and thus its variance V[El] and the statistical uncertainty of El vanish. This is known 
as the zero-variance property. Since the systematic error (or bias) of the variational energy 
AE = Ey — Eo (where Eq is the exact energy) also vanishes in this limit, there is a zero-bias 
property as well. For these reasons, a great deal of effort has been expended on developing 
robust and efficient wave-function optimization methods. 

1.2 Estimation of the statistical uncertainty 

In practice, the probability density /^(R) is sampled with the Metropolis-Hastings algorithm 
which provides a sequence of points R^ correctly distributed according to /?(R) but sequentially 
(or serially) correlated (i.e. non independent). This is a consequence of each point being 
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sampled from a probability distribution conditional on the previous point. One can define an 
autocorrelation time (defined more precisely later) that is roughly speaking the average time for 
points to decorrelate. This sequential correlation must be taken into account when using the 
central limit theorem for evaluating the statistical uncertainty. This is done using the blocking 
technique, which is described next. 

Let us consider a sequence of M realizations Xk (sequentially correlated) of a random variable 
X of expected value E[X] and of variance V[X]. For example, X could be the local energy Fl. 
We divide this sequence into Mb successive blocks of Mg steps each. The block average X]^ is 

_ 1 

® k=i 


The expected value of Xb is also the expected value of X, i.e. E [Xb] = E[X], but its variance 
is not simply V[X]/Ms since the variables X^ are not independent. We can now dehne the global 
average X of the whole sample as the average over all the blocks of the block averages 


X = 


T Mb 


b=l 


(7) 


where Xi, with a math subscript “6” indicates the block average for the 6*^ block (whereas Xb 
with a Roman subscript “b” indicates the generic random variable). The global average X is 
another random variable with the same expected value as X, i.e. E [X] = E [Xb] = E[X]. If 
the length of the blocks is large compared to the autocorrelation time then the block averages 
Xb can be considered as being independent, and the variance of the global average is simply 


V[X] 


V[Xb] 

Mb 


which leads to the statistical uncertainty of X 


a[X]= yvlxl = 



In practice, the statistical uncertainty on a finite sample is calculated as 


.[X] 


\ 


1 


1 


Mb 


Mb 


Mb - 1 \ Mb 


Ex.' 


6=1 


Mb 


Ex. 

6=1 


( 8 ) 


(9) 


( 10 ) 


where the Mb — 1 term appearing instead of Mb is necessary to have an unbiased estimator of 
the standard deviation on the sample (see the appendix). It takes into account the fact that the 
computed variance is relative to the sample average rather than the true expected value. 

Finally, let us examine the variance V [Xb] • Since the variables X^ are not independent, the 
expansion of V [Xb] involves the covariances between the variables 

V [X.] = E Cov[W, X,1 = Ml + ^ ^ Cov[X., X,1 = T,M1. (11) 

^ k,I S s S 
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defining the autocorrelation time of X 


= ( 12 ) 

The autocorrelation time is equal to 1 in the absence of correlation between the variables, i.e. 
Cov[Xfc,X;] = 0 for k ^ I, but can be large in the presence of sequential correlation. It is 
instructive to express the statistical uncertainty as a function of Tc 


cr[X] 



VW 

Ms Mb 



(13) 


where M = MgMb is the total size of the sample. The expression (13) allows one to interpret 
Tc as a factor giving the number of effectively independent points in the sample, Meff = M/Tc. 
In practice, it is useful to calculate the autocorrelation time as Tc = MgV [^b] /V[^] and 
check whether the length of the blocks is large enough for a correct estimation of the statistical 
uncertainty, e.g. Mg > 100 Tc. If Mg is not much greater than Tc, then the statistical uncertainty 
a [X] and the autocorrelation time Tc will be underestimated. 

In the appendix, we further explain how to estimate the statistical uncertainty of nonlinear 
functions of expectation values, which often occur in practice. 


1.3 Calculation cost 

The calculation cost required to reach a given statistical uncertainty a is 

t = t,M = t.hhS (14) 

where ts is the calculation time per iteration. The l/cr [Xj ^ dependence implies that decreasing 
the statistical uncertainty by a factor of 10 requires to increase the computational time by a 
factor of 100. This quadratic dependence directly stems from the central limit theorem and 
seems unavoidable^. However, one can play with the three other parameters: 

• Tc depends on the sampling algorithm and on the random variable X. For efficient algo¬ 
rithms such as Umrigar’s one [15, 3], the autocorrelation time of the local energy is close 
to 1 and little further improvement seems possible; 

• tg is usually dominated by the cost of evaluating X. For the local energy, the evaluation 
cost depends on the form of the wave function; 

• V[X] depends on the choice of the random variable X with its associated probability 
distribution, the only constraint being that the expected value E[X] must equal the ex¬ 
pectation value of the observable (otherwise, this is a biased estimator). The choice of 
a good probability distribution is usually called importance sampling. Even for a fixed 
probability distribution, it is possible to use various estimators for X, some of which have 

^Quasi Monte Carlo methods [14] can in some cases achieve a convergence rate of 0{ln{M) /M) rather than 
However, they have not been used for QMC applications, in part because in QMC the sampled 
distributions, for systems with more than a few electrons, are very highly peaked. 
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smaller variance than others, since one has the freedom to add any quantity with zero 
expectation value. This has been exploited to construct improved estimators for diverse 
observables [16, 17, 18, 19, 20]. There is often a compromise to be found between a low 
computation time per iteration tg and a low variance V[X]. 


1.4 Sampling technique 

The probability density, /o(R) = 'I'(R)^// dR'I'(R)^, is generally complicated and cannot be 
sampled by direct methods such as the transformation method or the rejection method. Instead, 
the Metropolis-Hastings (or generalized Metropolis) algorithm, which can be used to sample any 
known probability density, is used. It employs a stochastic process, more specifically, a Markov 
chain. 


Stochastic process 

A stochastic process represents the evolution - say in “time” - of a random variable. It is 
described by a trajectory of successive points Ri, R 2 , ..., Rm with an associated probability 
distribution P(Rm, •••, R 2 ) Ri)- The idea of evolution in time can be made more explicit by 
decomposing the probability of the whole trajectory in products of the conditional probability 
of having a particular point knowing that all the previous points have already been realized. For 
example, for M = 3, the probability of the trajectory is 

P(R3,R2,Ri) = P(R3|R2,Ri)F’(R2|Ri)R(Ri). (15) 


Markov chain 

A Markov chain is a stochastic process for which the conditional probability for the transition 
to a new point Rfc depends only on the previous point Rfc-i 

P(Rfc|Rfc_i, ...,Ri) = P(Rfc|Rfc_i), (16) 

i.e. the process “forgets” the way it arrived at point Rfc_i. The probability of a trajectory can 
thus be simply written as, e.g. for M = 3, 

P(R3,R2,Ri) = P(R3|R2)T(R2|Ri)P(Ri), (17) 

and P(Rf|Ri) is called the transition probability from point R, to point Rf. Note that, in 
general, the transition probability can depend on time (measured by the index k). We will 
consider here only the case of a stationary Markov chain for which the transition probability is 
time independent. 

In the following, we will use notation corresponding to the case of states R^ in a continuous 
space (“integrals” instead of “sums”), but we will ignore the possibly subtle mathematical dif¬ 
ferences between the continuous and discrete cases, and we will often use the vocabulary of the 
discrete case (e.g., “matrix”). The transition probability matrix, P is a stochastic matrix, i.e., 
it has the following two properties: 

P(Rf|Ri) > 0 (non negativity), (18a) 
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dRfP(Rf|Ri) = 1 (column normalization). 


(18b) 


The second property expresses the fact that the probability that a point Ri is somewhere at the 
next step must be 1. The eigenvalues of a stochastic matrix are between 0 and 1, and there is 
at least one eigenvalue equal to 1. The latter property is a consequence of the fact that, for a 
column-normalized matrix, the vector with all components equal to one is a left eigenvector with 
eigenvalue 1. The target probability distribution /?(R) is sampled by constructing a Markov chain 
converging to /9(R). A necessary condition is that the distribution /9(R) is a (right) eigenvector 
of P(Rf|Ri) with the eigenvalue 1 

I dRiP(Rf|Ri)p(Ri) = p(Rf) = J dRiP(Ri|Rf)p(Rf) VRf, (19) 

where the second equality simply comes from the normalization condition (18b). Eq. (19) is a 
stationarity condition for p(R). It means that if we start from the target distribution p(R) then 
we will continue to sample the same distribution by applying the Markov chain. However, we 
need more than that. We want that any initial distribution pini(R), e.g., a delta function at 
some initial point, evolves to the target stationary distribution /o(R) by repeated applications 
of the transition matrix 

hm /dRiP“(R|Ri)pini(Ri) = 

M—^oo J 

RM-l)---T’(R 2 |Rl)pini(Rl) = p(R), ( 20 ) 

i.e. p(R) must be the dominant eigenvector of P (the unique eigenvector of largest eigenvalue). 
If the stationarity condition (19) is satisfied then this will always be the case except if P has 
several eigenvectors with eigenvalue 1. One can show that the matrix P has only one eigenvector 
of eigenvalue 1 if P is a primitive matrix, i.e. if there is an integer n > 1 such that all the elements 
of the matrix P^ are strictly positive, P’^(Rfc|R;) > 0, V Rfc,R;. This means that it must be 
possible to move between any pair of states R^ and R/ in n steps. This ensures that all states 
can be visited, and that the Markov chain converges to the unique stationary distribution /^(R). 
The Markov chain is then said to be ergodic. 

In practice, instead of imposing the stationarity condition (19), the Markov matrix is con¬ 
structed by imposing the more stringent detailed balance condition. 


lim / dRidR 2 ...dRM T’(R|Rm)T’(Rm 

M—>-CXD J 


P(Rf|Ri)p(Ri) = P(Ri|Rf)p(Rf), 


( 21 ) 


which forces the probability flux between the two states Ri and Rf to be the same in both 
directions. This is a sufficient (but not necessary) condition for /^(R) to be the stationary 
distribution. A Markov chain satisfying condition (21) is said to be reversible. 

In practice, a Markov chain is realized by a random walk. Starting from an initial point 
Ri (or walker) - i.e. a delta-function distribution (5(R — Ri) - sample the second point R 2 by 
drawing from the probability distribution P(R 2 |Ri), then a third point R 3 by drawing from 
P(R 3 |R 2 ), and so on. After disregarding a certain number of iterations Meq corresponding 
to a transient phase called equilibration, the random walk samples the stationary distribution 
/ 9 (R) in the sense that /o(R) = E[(5(R — Rfc)] ~ (1/Af) 5(R — Rfc ) and the averages of the 

estimators of the observables of interest are calculated. The rate of convergence to the stationary 
distribution /o(R) and the autocorrelation times of the observables are determined by the second 


largest eigenvalue of the matrix P (see, e.g., Ref. [21]). The random walk must be sufficiently 
long so as to obtain a representative sample of the states making a non negligible contribution 
to the expected values. If the transitions between states belonging to two contributing regions of 
the space of states are too improbable, as may happen for example for dissociated atoms, then 
there is a risk that the random walk remains stuck in a region of space and seems converged, even 
though the true stationary distribution is not yet reached. To avoid this problem, smart choices 
for the transition matrix can be crucial in various applications of Monte Carlo methods [22, 23]. 


Metropolis-Hastings algorithm 

In the Metropolis-Hastings algorithm [12, 13], one realizes a Markov chain with the following 
random walk. Starting from a point Ri, a new point Rf is determined in two steps: 


1 . 

2 . 


a temporary point R[ is proposed with the probability Pprop(Rf|Ri), 


the point R[ is accepted (i.e. Rf = R[) with probability Pacc(Rf|R-i), or rejected (i.e. 
Rf = Ri) with probability Prej(R-f|Ri) = 1 — Pacc(Rf|Ri) 


The corresponding transition probability can be written as 


P(Rf|Ri) 


Pace (Rf|Ri)Pprop(Rf|Ri) if Rf / Ri 

1 - / dR[ Pacc(R'f|Ri)Pprop(R'f|Ri) if Rf = Ri 


or, in a single expression. 


( 22 ) 


P(Rf|Ri) = Pacc(Rf|Ri)Pprop(Rf|Ri) + 



dRf Pacc(Rf|Ri)Pprop(Rf|Ri) 


S(Ri 


Rf). (23) 


The proposal probability Pprop(Rf|Ri) is a stochastic matrix, i.e. Pprop(Rf|Ri) > 0 and 
f dRfPprop(Rf|Ri) = 1, ensuring that P(Rf|Ri ) fulfils the non-negativity condition (18a). The 
second term in Eq. (23) with the delta function ensures that P(RfjRi) fulfils the normaliza¬ 
tion condition (18b). The acceptance probability is chosen so as to fulfil the detailed balance 
condition (21), for Rf / Rf, 


Pacc(Rf|Ri) Pprop (RlRf)p(Rf) 

Pacc(Ri|Rf) Pprop(Rf|Ri)/0(Ri) 


Several choices are possibles. The choice of Metropolis et al. [12] maximizes the acceptance 
probability 


Pacc(Rf|Ri) 


■ f 1 -Pprop(Ri|Rf)p(Rf) ) 

1 ’ Pp,op(Rf|Ri)/9(Ri) / ■ 


(25) 


The acceptance probability is not a stochastic matrix, even though both the proposal and the 
total Markov matrices are stochastic. Since only the ratio p(Rf)//9(Ri) is involved in Eq. (25), 
it is not necessary to calculate the normalization constant of the probability density p(R). It is 
clear that the acceptance probability of Eq. (25) is optimal, but there is considerable scope for 
ingenuity in choosing a proposal probability Pprop (Rf|Ri) that leads to a small autocorrelation 
time. 
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Choice of the proposal probability 


The original paper of Metropolis et al. [12] employed a symmetric proposal matrix, in which 
case the proposal matrix drops out of the formula for the acceptance. The advantage of having 
a nonsymmetric proposal matrix was pointed out by Hastings [13]. One has a lot of freedom in 
the choice of the proposal probability Pprop(R-flR-i)- The only constraints are that Pprop(HflRi) 
must be a stochastic matrix leading to an ergodic Markov chain and that it must be possible 
to efficiently sample Pprop(H,flHi) with a direct sampling method. The proposal probability 
determines the average size of the proposed moves Ri —)■ Rf and the average acceptance rate 
of these moves. In order to reduce sequential correlation, one has to make moves as large as 
possible but with a high acceptance rate. In practice, for a given form of the proposal matrix, 
there is a compromise to be found between the average size of the proposed moves and the 
average acceptance rate. 

The simplest choice for Ppi.op(Rf|Ri) is a distribution that is uniform inside a small cube 
ll(Ri) centered in Ri and of side length A and zero outside 


Hprop(Rf|Ri) 


^ ifRfefl(Ri) 
0 elsewhere . 


(26) 


In practice, a move according to Eq. (26) is proposed, 


Rf = Ri + y X, 


(27) 


where x is a vector of 2>N random numbers drawn from the uniform distribution between — 1 
and 1. The size of the cube A can be adjusted so as to minimize the autocorrelation time of the 
local energy, but the latter remains large and the sampling is inefficient. 


Clever choices use information from the distribution /9(R), in particular its local gradient, 
to guide the sampling. A choice for Pprop(Rf|Ri) which would lead to large moves with an 
acceptance probability equal to 1 would be Pprop(RflRi) = ^(Rf), independently from Rf, but 
we would then be back to the initial problem of sampling a complicated distribution /o(R). A 
good choice for Ppi.op(Rf|Ri) is the Green function of the Fokker-Planck equation in the short- 
time approximation 


^prop(R-flR'i) 


1 (Rf-Ri-v(Ri)-r)^ 

(27rr)'^/^" 


(28) 


where v(R) = V'I'(R)/'I'(R) is called the drift velocity of the wave function and r is the time 
step which can be adjusted so as to minimize the autocorrelation time of the local energy. In 
practice, a move according to Eq. (28) is proposed 


Rf = Ri -I- v(Ri)T -I- 77, ( 29 ) 

where 77 is a vector of 3N random numbers drawn from the Gaussian distribution of average 0 
and standard deviation ^/T. The term 77 describes an isotropic Gaussian diffusion process (or 
Wiener process). The term v(Ri)r is a drift term which moves the random walk in the direction 
of increasing j'k(R)j. 

The optimal size of the move is smaller in regions where v(R) is changing rapidly. For 
example, v(R) has a discontinuity at the nuclear positions. Hence, it is more efficient to make 
smaller moves for electrons in the core than for electrons in the valence regions. In doing this, 
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care must be taken to ensure the detailed balance condition. An elegant solution is provided in 
the VMC algorithm of Refs. [15, 3] where the electron moves are made in spherical coordinates 
centered on the nearest nucleus and the size of radial moves is proportional to the distance to 
the nearest nucleus. In addition, the size of the angular moves gets larger as one approaches 
a nucleus. This algorithm allows one to achieve, in many cases, an autocorrelation time of the 
local energy close to 1. 


Expectation values 

The expectation value of an operator O can be computed by averaging the corresponding local 
value 0(Rf) = (Rf|0|T)/T(Rf) at the Monte Carlo points Rf after the accept/reject step. A 
somewhat smaller statistical error can be achieved by instead averaging 

Race (RfiRi) 0(Rf) + (1 

Pa.ee (RfiRi)) 0(Ri), (30) 

regardless of whether the proposed move is accepted or rejected. 


Moving the electrons all at once or one by one? 

So far we have assumed that, for a many-electron system, all the electrons are moved and then 
this move is accepted or rejected in a single step. In fact, it is also possible to move the electrons 
one by one, i.e. move the first electron, accept or reject this move, then move the second electron, 
accept or reject this move, and so on. In this case, the transition probability for N electrons 
can be formally decomposed as 

P(Rf|Ri) = P(riyr2,f...r7v,f|riyr2,f...rAr,i) x 

... X P(ri^fr2,f...rAr,i|riyr2,i...rAr,i) x P(riyr 2 ,i...rAr,i|ri,ir 2 ,i...rAr,i), (31) 

where each one-electron transition probability (knowing that the other electrons are fixed) is 
made of a proposal probability and an acceptance probability just as before. If each one- 
electron transition probability satisfies the stationary condition (19), then the global transition 
probability satisfies it as well. 

Moving the N electrons one by one requires more calculation time than moving the electrons 
all at once, since the wave function must be recalculated after each move to calculate the 
acceptance probability. The calculation time does not increase by a factor of N as one may 
naively think but typically by a factor of 2 if the value of the wave function is recalculated in 
a clever way after an one-electron move. For example, for Slater determinants, one can use 
the matrix determinant lemma in conjunction with the Sherman-Morrison formula (see, e.g., 
Ref. [24]) to efficiently recalculate the values of the determinants when only one row or column 
has been changed. In spite of the increase in the calculation time, it has been repeatedly shown in 
the literature (see, e.g., Refs. [10, 15, 25, 26]) that, for systems with many electrons, moving the 
electrons one by one leads to a more efficient algorithm; larger moves can be made for the same 
average acceptance, so the points R^ are less sequentially correlated and the autocorrelation 
time of the local energy is smaller (by a factor larger than the one necessary for compensating 
the increase of the calculation time per iteration). 
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2 Diffusion Monte Carlo 


2.1 Basic idea 


While the VMC method is limited by the use of an approximate wave function '1, the idea of 
the DMC method [27, 28, 29, 5, 30] is to sample from the exact wave function 'I'o of the ground 
state of the system. If we have this exact wave function 'I'O) then the associated exact energy 
Eq can be obtained from the mixed expectation value using the trial wave function 'I, 


(^'ol^l^) _ f dR^o(R)^(R)^L(R) 

(^-ol^) /dR^o(R)^'(R) 


since is an eigenfunction of the Hamiltonian H. The advantage of the mixed expectation 
value (32) is that it does not require calculating the action of H on Tq. The integral in Eq. (32) 
involves the local energy of the trial wave function, £'l(R) = (R'I'(R))/'I'(R), and can be 
estimated in a similar way as in VMC by calculating the average of E]^ (R) on a sample of points 
Rfc representing the mixed distribution To(R)'I'(R)/ f dR'I'o(R)'I'(R). 

But how to access to the exact wave function Tq? Let us consider the action of the imaginary- 
time evolution operator (t —)• —it) on an arbitrary wave function such as the trial wave function 
T 

\^>{t)) = (33) 

where Et is for now an undetermined trial energy. Using the spectral decomposition of the 

evolution operator (written with the eigenstates Tj and the eigenenergies Ej of il), we see that 
the limit of an infinite propagation time is dominated by the state Tq with the lowest energy 
having a nonzero overlap with T 

lim |T(t)) = lim V= jirn ( 34 ) 

£—>-cx) £—>-cxD t^oo 

i 

since all the other states of energies Ei > Eq decay exponentially faster. The exponential 
g-fEo-Ex)* eliminated by adjusting Et to Eq, and we then obtain that 'l'(t) becomes 

proportional to Tq 

lim |T(t)) oc |To). (35) 

r—)-oo 

In position representation, Eq. (33) is written as 


T(Rf,t) = y’dRiG(Rf|Ri;t)T(Ri), 


(36) 


where G(Rf|Ri;t) = (Rf|e-(^-'®T)4|R.) is called the Green function (or the imaginary-time 
propagator from Ri to Rf). Multiplying and dividing by 'I'(Rf) and T(Ri), we obtain the 
evolution equation of the mixed distribution /(R, t) = 'I'(R, t)T(R) 

/(Rf, t) = J dRi G(Rf|Ri; t) (37) 

where (/(RfjRiU) is the importance-sampling Green function, 

G'(Rf|Ri;t) = 'I'(Rf) G(Rf|Ri;t) , , , (38) 

'h(Ri) 
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i.e. G(Rf|Ri;t) is G(Rf|Ri;i) similarity transformed by the diagonal matrix that has the values 
of 'h along the diagonal. In the limit of infinite time, the mixed distribution becomes proportional 
to the target stationary distribution: /(R) = limi_).oo/(R, t) oc 'ho(R)'I'(R). 

In practice, an analytical expression of the Green function is known only in the limit of a 
short propagation time, G'(Rf|Ri;r), where r is a small time step, and one must thus iterate to 
obtain the stationary distribution 


/(R)= lim /dRidR 2 ...dRMG(R|RM;r)G(RM|RM-i;T)...G(R 2 |Ri;r)^'(Ri) 2 . (39) 

M —>-00 J 

A short-time approximation to the Green function is obtained by applying the Trotter-Suzuki 
formula, + O(r^), where T and V are the kinetic and potential 

energy operators. In position representation, this approximation leads to the following expression 


G(Rf|Ri;r) 


1 

(27rr)3^/2" 


2 t e 


(V(RP+V(RP_^^)^ 


(40) 


where I4(R) is the potential energy. Similarly, assuming for now that the trial wave function 
is of the same sign in R, and Rf, i.e. 'h(Rf)/'I'(Ri) > 0, a short-time approximation to the 
importance-sampling Green function is [5, 31] 


G(Rf|Ri;r) 


1 


(Rf-Ri-v(Ri)r)^ ( gL(Rf)+^L(Ri) 

2 ^ e ^ ^ 



(41) 


where the drift velocity v(R) = V'I'(R)/'I'(R) and the local energy £'l(R) were assumed 
constant between R; and Rf. This short-time approximation implies a finite time-step error in 
the calculation of all observables, which should in principle be eliminated by extrapolating the 
results to r = 0 (see Refs. [32, 33, 34] for proofs that the time-step error vanishes in the r —)■ 0 
limit). 


2.2 Stochastic realization 

The stochastic realization of Eq. (39) is less trivial than for VMC. The Green function G(Rf|Ri; r) 
is generally not a stochastic matrix, since it does not conserve the normalization of the probabil¬ 
ity density: f dRf (^(RflRi; r) 1. We can nevertheless write the elements of G as the product 
of the corresponding elements of a stochastic matrix P and a weight matrix IT, 

(5(Rf|Ri; r) = P(Rf|Ri)IT(Rf|Ri), (42) 

where, in the short-time approximation, P(Rf|Ri) = (27rr)~^'^'^^ and 

IT(Rf|Ri) = e“h-®L(R.f)-l-SL(Ri))/2-ET)T_ i^j^at G reduces to a stochastic matrix in the limit 

T —)• Tq. The stochastic realization is then a weighted random walk. Start from a walker at 
an initial position Rf with a weight wi = 1, i.e., a distribution t(;i(5(R — Rf). Sample the 
position R 2 of the walker at the next iteration from the probability distribution P(R 2 |Ri) 
[according to Eq. (29)] and give it weight W 2 = IT(R 2 |Ri) x wi, sample the third position R 3 
from the probability distribution P(R 3 |R 2 ) and give it weight W 3 = IT(R 3 |R 2 ) x W 2 , and so 
on. After an equilibration phase, the random walk should sample the stationary distribution 
/(R) oc EK(i(R-Rfc)] « In reality, this procedure is terribly 

inefficient. Because the weights Wk are products of a large number of random variables, they 
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can become very large at some iterations and very small at other iterations. Consequently, the 
averages are dominated by a few points with large weights, even though the calculation of any 
point of the Markov chain takes the same computational time regardless of its weight. This 
problem can be alleviated by keeping the product of the weights for only a finite number n of 
consecutive iterations [35] 

k 

Wk= n (43) 

However, using a finite n introduces a bias in the sampled stationary distribution. In practice, 
for an n large enough to have a reasonably small bias, this procedure still remains inefficient. 

The solution is to use at each iteration k a population of walkers, with positions Rfc,a and 
weights Wk,a (where 0 = 1,2,..., M^), performing random walks with a branching or birth-death 
process designed to make the weights vary in only a small range from walker to walker in 
a given iteration, and from iteration to iteration, while still sampling the correct distribution 
/(R) oc E[X)^i rcfc,Q:(5(R - Rfc,a)] ~ (1/^) Various unbiased 

variants are possible, characterized by a population size that either varies or is constant 
from iteration to iteration, and by weights tCfc.o that can either be equal or different for each 
walker. 

The simplest variant uses a varying population size and weights all equal to one, Wk^a = 1- 
At each iteration k, each walker a is replaced by unit-weight copies of itself, where 
is an integer equal on average to what should be the current weight Wk^a = hh(Rfc,o|Rfc-i,o)- 
For example, if the walker a should have the weight Wk^a = 2.7 at iteration k, this walker is 
replaced by = 3 copies of itself with a probability 0.7 or replaced by mk^a = 2 copies 

of itself with a probability 0.3. More generally, mk,a = + 1 with probability Wk,a — 

a-iid ruk^a = L^fc,aJ otherwise, where is the nearest integer smaller than Wk^a- 

If = 0 the walker is terminated. This procedure does not change the sampled stationary 
distribution This variant has the disadvantage that the integerization of the weights results 
in unnecessary duplications of walkers, leading to more correlated walkers and thus to a smaller 
number of statistically independent points in the sample. Another disadvantage is that it leads 
to unnecessary fluctuations in the sum of the weights, a quantity that is relevant for computing 
the growth estimator of the energy. 

A better solution is the split-join algorithm [6] which limits the duplication of walkers by 
keeping residual noninteger weights Wk^a- At each iteration k, after updating the weights ac¬ 
cording to Wk,a = kT(Rfc,o|Rfc-i,a) X Wk-i,a, cach walker a with a weight Wk,a > 2 is split into 
\wk,a\ walkers, each being attributed the weight Wk^a/\wk^a\- H walkers a and P each have 
weight < 1/2, keep walker a with probability Wk,al{wk,a + and walker f3 otherwise. In 

either case, the surviving walker gets weight, Wk^a + Wjs^k- This algorithm has the advantage that 
it conserves the total weight of the population of walkers Wk = '^a=i '^k,a for a given iteration. 
Yet another possibility is the stochastic reconfiguration algorithm [36, 37], which uses a fixed 
population size Mk, and walkers of equal noninteger weights within each iteration, though the 
weights of the walkers fluctuate from one iteration to the next. 

To avoid the explosion or extinction of the population of walkers (or their weights if Mk 
is kept fixed), the value of can be adjusted during the iterations. For example, a choice 

^One can write: E lTj,.c5(R - R;=,„)] = E - Rfe,„)] = E 5(R - R*.+i,o)], 

where Rfe+i,c. are the positions of the Mk+i = uifc.o walkers used for the next iteration A: + 1 obtained after 

making mk,a. copies of the walker. 
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for £'t at iteration A: + 1 is ET^{k + 1) = E^^{k) — C\og{Wk/WQ) where £'q®*(A:) is an estimate 
of Eq at iteration A:, C is a constant, is the total weight of the population of walkers and 
Wq is the target total weight. Because of fluctuations, E'y thus slightly varies with respect to 
Eq during the iterations, which introduces a systematic bias on the weights and thus on the 
stationary distribution /(R). The adjustment of E'y makes /(R) too small in regions where 
^l(R) < Eq and too large in regions where £^l(R) > -^o- Both of these have the effect of 
raising the energy. This is called population-control error. This error is generally small and 
decreases with increasing number of walkers as l/M^ [6]. Besides, it is possible to eliminate 
almost completely this error by undoing the modihcation of weights introduced by the variation 
of E'l for the last several iterations [38, 6]. 

In the limit of an infinitesimal time step, the transition matrix P(Rf|Ri) has a stationary 
distribution T(R)^, and the weight term IT(Rf|Ri) converts this distribution into the mixed 
distribution 'I'o(R)'I'(R). One can get rid of the finite time-step error in the transition matrix 
P(Rf|Ri) by introducing an accept/reject step as in the Metropolis-Bastings algorithm [5]. For 
this, the transition matrix is redefined as P(Rf|Ri) = Pacc(Rf|Ri)Rprop(Rf|Ri), for Ri / Rf, 
with the proposal probability 


Fprop(Rf|Ri) = 


1 


and the acceptance probability 

Pacc(RflRi) = min 


(Rf-R;-V(R;)T)^ 

e 2 T 


^ Pp,op(Ri|Rf)^(Rf)^ ) 

’ Pp™p(Rf|Ri)T(Ri)2 


(44) 


(45) 


With this modification, P(Rf|Ri) has the stationary distribution 'I'(R)2 even for a finite time 
step. Of course, the finite time-step error persists in the term IT(Rf|Ri). Since certain moves 
are rejected, P(Rf|Ri) corresponds now to a process of diffusion with drift with an effective time 
step Teff < r. This effective time step can be estimated during the calculation from the average 
acceptance rate and it is consistent to use it in the term W (RfjRi) in place of r. In practice, just 
as in VMC, it is also more efficient in DMC to move the electrons one by one, i.e. to decompose 
P(Rf|Ri) according to Eq. (31). We then arrive at a DMC algorithm very similar to the VMC 
algorithm, with weights in addition. Note, however, that since a relatively small time step must 
be used in DMC, the average moves are smaller than in VMC and the autocorrelation time of 
the local energy is larger than in VMC. 

The energy is calculated as the average of the local energy over the distribution /(R) / f dR/(R). 
For M iterations (after the equilibration phase) and walkers, we have 


Eq ~ El 


Z^k=l 2-ja=l ^k,a 


(46) 


Just as in VMC, there is a zero-variance principle on the energy in DMC. In the limit that the 
trial wave function T is an exact eigenfunction of the Hamiltonian, El is independent of R, the 
weights reduce to 1, and the variance on El vanishes. 

Note that for an observable O that does not commute with the Hamiltonian, the average 
Ol over the mixed DMC distribution is an estimator of ('I'o|0|'I')/('I'o|'I') which is only an 
approximation to the exact expectation value ('I'o|0|'I'o)/(4'o|4'o) with anC)(||'k — Toll) error. 
Since the average Ol over the VMC distribution also has an error that is linear in HT — To|| 
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but with a prefactor that is twice as large, an 0(| — 'I'ol P) approximation is provided by twice 

the average of Ol over the mixed DMC distribution minus the average of Ol over the VMC 
distribution [39]. For a recent survey of exact methods for sampling the pure distribution 'I'q, 
see Ref. [40], and for a discussion of the techniques used for calculating pure expectation values 
of various classes of operators see Ref. [2]. 

2.3 Fermionic sign problem 

In Eq. (41), we have assumed that the trial wave function 'I'(R) is always of the same sign, 
i.e. that it does not have any nodes (points R so that ^(R) = 0). This is the case for the 
ground-state wave function of a Bosonic system, and for a few simple electronic systems (two 
electrons in a spin-singlet state, such as the ground state of the He atom or of the H 2 molecule). 
In this case, the algorithm presented above allows one to obtain the exact energy of the system, 
after elimination of the finite time-step error and the population-control error. If the wave 
function of the Fermionic ground state has nodes, then there is always at least one Bosonic state 
of lower energy, and the true ground state of the Hamiltonian is a Bosonic state for which the 
wave function 'I'b(R) can be chosen strictly positive. If one applied the Green function exactly, 
starting from the distribution 'I'(R)^ the distribution would correctly converge to 'I'o(R')'I'(R) 
since the trial wave function is antisymmetric (with respect to the exchange of two electrons) 
and has a zero overlap with all the Bosonic states which are symmetric. However, in reality 
one applies the Green function using a finite sampling in position space which does not allow 
one to impose the antisymmetry. Starting from an antisymmetric wave function 4', a small 
component of 'I'b can thus appear, and it grows and eventually dominates. The distribution 
tends to 'I'b(R)^(R) and the energy formula in Eq. (46) only gives 0/0 (the positive and negative 
weights cancel out) with statistical noise. Even if one imposed antisymmetry and eliminated the 
Bosonic states, e.g. by considering all electron permutations in each walker, the problem persists 
because different paths between the same points in this antisymmetrized space can contribute 
with opposite sign. Since 'I'o and —^0 are equally good solutions of the Schrodinger equation, 
the algorithm would sample each with approximately equal probability, leading again to the 
cancellation of positive and negative weight contributions. These are different manifestations of 
the infamous Fermionic sign problem. 


2.4 Fixed-node approximation 

To avoid the sign problem in DMG, the fixed-node approximation (EN) [28, 41, 29] is introduced. 
The idea is to force the convergence to a wave function approximating the Eermionic ground 
state by hxing its nodes to be the same as those of the trial wave function 'h (R) . Eormally, one 
can define the FN Hamiltonian, RfN; by adding to the true Hamiltonian H infinite potential 
barriers at the location of the nodes of ^(R) [42]. The ground-state wave function of this 
Hamiltonian is called the FN wave function and its energy is the FN energy EpN, 

Rfn|4'fn) = FfnI'^fn)- (4:7) 

In the 3AI-dimensional space of positions R, the nodes of 4'(R) define hypersurfaces of dimension 
3N — 1. The position space is then partitioned in nodal pockets of 'h(R), delimited by nodal 
surfaces, in which the wave function has a fixed sign. In each nodal pocket, the FN wave function 
is the solution to the Schrodinger equation satisfying vanishing boundary conditions on the nodal 
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surface. The FN Green function corresponding to the Hamiltonian FTfn is 

GFN(Rf|Ri; t) = (Rf|e-(^™-^T)t|R.)^ (48) 

and only permits moves Rj —)• Rf inside a nodal pocket. The importance-sampling FN Green 
function, 

G'FN(Rf|Ri;i) = 'k(Rf) GFN(Rf|Ri;i) (49) 

also confines the moves inside a nodal pocket, and it is thus always greater or equal to zero. 
A short-time approximation to GFN(Rf|Ri;i) is then again given by Eq. (41). The stochastic 
algorithm previously described can thus be applied directly. Thanks to the FN approximation, 
the weights always remain positive, and the stationary mixed distribution /(R) is proportional 
to TFN(R)'k(R)- 

The largest contributions to the finite time-step error come from singularities of the drift 
velocity v(R) = V'I'(R)/'I'(R) and of the local energy £'l(R) in the Green function of Eq. (41). 
Since the gradient of the trial wave function V'I'(R) (and of the exact wave function) does not 
generally vanish at the location of the nodes, the drift velocity v(R) diverges at the nodes, which 
leads to too large moves near the nodes for finite time steps. The drift velocity has discontinuities 
also at particle coalescences (both electron-nucleus and electron-electron). Similarly, for an 
approximate trial wave function 'I'(R), the local energy FiL(R) also diverges at the nodes and at 
particle coalescences (unless the Kato cusp conditions [43, 44] are imposed). The finite time-step 
error can be greatly reduced by replacing v(R) and FiL(R) in the Green function by approximate 
integrals of these quantities over the time step r [6]. 

If importance sampling is not used, it is necessary to kill walkers that cross the nodes of T 
to impose the FN boundary condition. In practice importance sampling is almost always used. 
In that case, it is better to reject the moves of walkers crossing the nodes, which is consistent 
with the FN approximation, but even this is not necessary since the number of walkers that 
cross the node per unit time goes to zero as r —)• 0 [6]®. For a finite time step, there are node 
crossing events, but these are just part of the finite time-step error and in practice essentially 
the same time-step error is obtained whether the walkers are allowed to cross nodes or not. 

We may wonder whether the walkers have to sample all the nodal pockets. The tiling 
theorem [45] establishes that all the nodal pockets of the ground-state wave function of a many- 
electron Hamiltonian with a reasonable local potential are equivalent, i.e., the permutations of 
any nodal pocket are sufficient to cover the entire space. This means that, for ground-state 
calculations, the distribution of the walkers over the nodal pockets is irrelevant. 

By applying the variational principle, it is easy to show that the FN energy is an upper 
bound to the exact energy 

_ (4'fn|-Rfn|4'fn) _ (4'fn|R|4'fn) . ^ 

(TfnI^fn) ('I'fnI'I'fn) ~ ° 

the second equality coming from the fact that the infinite potential barriers in Rfn do not 
contribute to the expectation value since Tfn is zero on the nodal surface. Since the wave 

®The drift velocity moves electrons away from the nodal surface, but for small r the diffusion term dominates 
and can cause walkers to cross nodes. The density of walkers goes quadratically to zero near nodes and walkers 
that are roughly within a distance yT can cross. Hence the number that cross per Monte Carlo step goes as 

x^dx ~ and so the number that cross per unit time goes to zero as yT. 
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function 'I'fn is an eigenfnnction of -f^FN) the FN energy can also be expressed nsing the mixed 
expectation valne 


('I'fnI^) ('I'fnI'I') 


(51) 


where the Hamiltonian Hfn has been replaced by H for essentially the same reason as before, 
viz., both and illFN are zero where Hp-^ is infinite. In practice, the FN energy is thus obtained 
by the same energy formnla (46). 


The accuracy of the DMC results with the FN approximation thus depends on the quality 
of the nodal snrface of the trial wave function. For a trial wave function with a single Slater 
determinant, the error due to the FN approximation can often be large, even for energy dif¬ 
ferences. For example, for the C 2 molecnle, the FN error for a single-determinant trial wave 
function is 1.6 eV for the total energy and 0.8 eV for the dissociation energy [46]. In order 
to reduce this error, one can nse several Slater determinants and optimize the parameters of 
the wave fnnction (Jastrow parameters, coefficients of determinants, coefficients that express 
the orbitals in terms of the basis functions, and exponents of the basis fnnctions) in VMC 
(see Refs. [47, 48, 49, 50, 51, 52, 53, 46]). This allows one to reach near chemical accuracy 
(~ 1 kcal/mol) in DMC for calculations of energy differences such as molecular atomization 
energies [54]. 
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Appendix: Statistical estimator of nonlinear functions of expec¬ 
tation values 


We often need to estimate nonlinear functions of expectation values. The simplest example is 
the variance, 

\[X]=E[X^]-E[Xf, (52) 

which is a quadratic function of the expectation values of two random variables X'^ and X. 
Another example is the calculation of the energy in DMC using weights [see Eq. (46)], with 
simplified notation, 

_ E[wEi^] 

E[w] 

involving a ratio of two expectation values. 



Consider a nonlinear function, /(E[A1], E[y]), of two expectation values, E[Ai] and E[y]. The 
usual simple estimator of /(E[A],E[y]) is f{X,Y) where 


and 


A 


1 

Mb 


Ew, 


6=1 


(54) 


Y 


1 

A4 




6=1 


(55) 


are averages over a finite number of blocks Mb, and Xf, and Yi, are the block averages of X 
and Y, respectively [see Eq. (6)]. As discussed before, each block average is itself an average 
over a sufficiently large number of steps, Mg, so that the block averages can be assumed to be 
independent of each other. The simple estimator can be justified as follows, (i) When the law of 
large numbers holds, X and Y converge, with increasing Mb, almost surely to E[A] and E[y], 
respectively, (ii) Hence, f{X,Y) converges to /(E[A],E[y]) provided that / is continuous at 
the point (E[A],E[y]). However, because / is nonlinear, f{X,Y) has a systematic error, i.e. 
E[f {X,Y)\ 7 ^ /(E[A],E[y]), that vanishes only in the limit of infinite sample size, Mb —)• oo. 
Though not necessary, in the following, for the sake of simplicity, we assume that f{X,Y) has 
a finite expectation value and a finite variance®. 


Systematic error 

Let us first consider the case of a nonlinear function f{x) of a single variable. By definition, 
the systematic error of the estimator f{X) is E[/(A)] — /(E[A]). The systematic error can be 
evaluated using a second-order Taylor expansion of the function f{X) around E[A] (assuming 
that / is at least a function in the neighborhood of E[A]) 

f(X) = f{E[X]) + (X - E[A]) + i (^0) (X - E[A])2 + ... , (56) 

^E[f{X ,Y)] can be undefined when / has a point at which it diverges, e.g., f{x,y) = x/y. In this case, this 
definition of the systematic error does not have a strict meaning. In practice, this is not a problem for this / 
provided that the absolute value of the expectation value of Y over a block is larger than a few times the square 

root of its variance, say, |i?[yb]| > 5y T[W]. 
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where the derivatives of / are evaluated at E[X]. If we take the expectation value of this 
expression, the linear term vanishes 


E[f{X)] = f{E[X]) + i (0) E[{X - E[X])2] + ... . 


(57) 


Assuming the random variable X has a finite variance and that the higher-order terms can be 
neglected, the systematic error is thus 


E|/(X)| -/(E|X1) = 1(0) V(X1 + ... 


l(f£] 

2 Vdx2 J Mb 


(58) 


Hence, the estimator f{X) has a systematic error with a leading term proportional to 1/Mb. 
Note that if the hypotheses (especially the finite variance) are not satisfied, the systematic error 
can decrease more slowly than 1/Mb. Equation (58) can easily be generalized to a function of 
several variables. For example, for two variables, the systematic error is 


E[f{X,Y)]-f{E[X],E[Y]) 


V[Xb] I (d^f\ 

2 \ dx"^ J Mb 2 \ 9 y 2 J 

: ( Cov[Xb,Fb] 

\dxdy) Mb 


v[n] 

Mb 


(59) 


where the second-order derivatives are evaluated at (E[A],E[y]). The leading neglected term is 
0(1/Af2) if the third moments of X and Y are hnite. The second-order derivatives in Eq. (59) 
can in practice be evaluated at {X, Y) without changing the order of the approximation. Hence, 
an estimator for /(E[A],E[y]) with an 0(1/M^) error is 


f{E[X],E[Y]) 


f(x FI - 1 - 1 XlXhl 

^ 2\dxy Mb 2\dyy Mb 
/ d^f \ Cov[Fb,Fb] 

\dxdy) Mb 


(60) 


where the second-order derivatives are evaluated at (AijE). 

This approach is general and can be used to recover some well-known unbiased estimators. 
For example, let us consider the covariance of two random variables 


Cov[X,y] = E[xy] - E[A]E[y] = f{E[XY],E[X],E[Y]), (61) 

for which f{x, y,z) = x — yz. In this case, the generalization of Eq. (59) to three variables with 
X = {l/M) Yliti Y = (1/M) X)Xi F where Xi and Yi are M uncorrelated realizations 

of X and Y, respectively, gives 

E[FF-XF]-Cov[A,y] = (62) 

which leads to the usual unbiased estimator for the covariance 

M 1 ^ 

Cov[.Y,r| » -^^(XY-XY) = .— Y_(X,-X)(Y.-Y). (63) 
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Statistical uncertainty 


First consider a function of a single variable. The statistical uncertainty of f{X) is given 
by cr[f{X)] = yJx[f{X)] where the variance of f{X) is X[f{X)] = E {f{X) -E[/(X)])^ 
Subtracting Eq. (57) from Eq. (56) gives 


f{X) - E[f{X)] = ( ^ 


(X-E[X]) + 


(64) 


Taking the square of this equation and the expectation value leads to the leading term in the 
variance of f{X) 

V[/(X)] = (^) V[X] + .... (65) 

This equation can be generalized to a function of several variables. For example, for two vari¬ 
ables, the variance of f{X,Y) is 

V[/(X.F)1 = (I)' V|X1 + (I)' V[F1 + 2 (I) (I) Cov[X,Fl +.... (66) 


Equation (66) can be used for estimating the variance of f{X,Y) at the cost of evaluating the 
variances V[X] and V[T] and the covariance Cov[X,y]. Note however, that it can give a severe 
underestimate of the error if df /dx and df /dy are small and Mb is not sufficiently large. 

There is a simple alternative for estimating the variance of / that does not suffer from this 
limitation and does not require calculating covariances. Consider again the case of a single 
variable. Instead of defining the block average of / in the obvious way, i.e. = f{Xi,), we 
define the block average of / as 


fi = f{Xi) for the first block 6=1 

fb = bf{X{b)) - (6 - l)fiX{b - 1)) for any block 6 > 2, (67) 

where ^(6) is the running global average up to block 6 


X(6) = t 

b' = l 

With this dehnition of the block average, it is easy to check that 


f{X) 


1 

Mb 




6=1 


( 68 ) 


(69) 


i.e. we have written f{X) as an average of random variables fj,. Provided that the variance of 
X is finite, the block average fi, introduced in Eq. (67) can be expanded as 

7, = f{E[X]) + (X, - E[X]) + • • • . (70) 

Assuming that / has a second-order Taylor expansion, the neglected term converges to zero in 
probability for large 6, at least as l/(6Ms). Therefore, according to Eq. (70), for large 6, the 
random variables converge to independent and equidistributed random variables (since the 
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random variables Xf, are)^. Consequently, the variance of f{X) can be estimated with the usual 
formula 

v[/mi.^.^(^|7r/(X)^), (71) 

This formula applies similarly for functions of several variables. The advantage of Eq. (71) 
over Eq. (66) for estimating the variance is that it is much simpler to implement and compute, 
especially for functions of many variables. The estimation of the variance can be simply updated 
at each block, just as for the expectation value. 


^The naive definition of the block average as /j, = f{Xf,) would also lead to Eq. (70) but the neglected term 
would not converge to zero for large b. 
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